Practical session on fitting distributions

F. Bertrand

2022-09-29

1 Introduction

1.1 Package installation

Let’s start by installing the packages we will need in the following.

if(!require(devtools)){install.packages("devtools");library(devtools)}
## Le chargement a nécessité le package : devtools
## Le chargement a nécessité le package : usethis
if(!require(ggmap)){install.packages("ggmap");library(ggmap)}
## Le chargement a nécessité le package : ggmap
## Le chargement a nécessité le package : ggplot2
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.

1.2 Recovery of hydrolic data

The data to be studied were presented in Dione (1992).

station_data <- read.csv("https://tinyurl.com/y2kosb2y")
longlat <- read.csv("https://tinyurl.com/yy9f9bsf")

The KIDIRA, GOURBA(ssi) and FADOU(gou) columns correspond to annual flows between 1903 and 1989. The columns GOUc and KILIc correspond to complementary hydrolic data.

The dms2dd function of the biogeo package allows to convert coordinates known as longitude and latitude expressed in degrees, minutes and seconds into decimal degrees.

if(!require(biogeo)){install.packages("biogeo");library(biogeo)}
## Le chargement a nécessité le package : biogeo
latdd<-with(longlat,dms2dd(LATDEG,LATMIN,0,LATNS))
latdd
## [1] 14.46667 12.80000 12.73333 12.26667 11.31667 10.75000 14.46667 13.40000
## [9] 12.51667
min(latdd)
## [1] 10.75
max(latdd)
## [1] 14.46667
longdd<-with(longlat,dms2dd(LONGDEG,LONGMIN,0,LONGOE))
min(longdd)
## [1] -12.3
max(longdd)
## [1] -11.01667
typesite <- (longlat$NATURE.DU.SITE=="STATIONS PLUVIOMETRIQUES")+1
df_dd_stations <- data.frame(cbind(latdd,longdd,typesite))

1.3 Graphic visualization of the position of the measuring stations.

Recovery of geographical data from the Internet.

bbox = bounding box.

min(longdd)
## [1] -12.3
max(longdd)
## [1] -11.01667
min(latdd)
## [1] 10.75
max(latdd)
## [1] 14.46667
faleme <- get_stamenmap(bbox = c(left = min(longdd)*1.05, bottom = min(latdd)/1.05, right = max(longdd)/1.05, top = max(latdd)*1.05), 
zoom = 8)
## Source : http://tile.stamen.com/terrain/8/118/117.png
## Source : http://tile.stamen.com/terrain/8/119/117.png
## Source : http://tile.stamen.com/terrain/8/120/117.png
## Source : http://tile.stamen.com/terrain/8/118/118.png
## Source : http://tile.stamen.com/terrain/8/119/118.png
## Source : http://tile.stamen.com/terrain/8/120/118.png
## Source : http://tile.stamen.com/terrain/8/118/119.png
## Source : http://tile.stamen.com/terrain/8/119/119.png
## Source : http://tile.stamen.com/terrain/8/120/119.png
## Source : http://tile.stamen.com/terrain/8/118/120.png
## Source : http://tile.stamen.com/terrain/8/119/120.png
## Source : http://tile.stamen.com/terrain/8/120/120.png

Several types of land are available.

maptypes = c("terrain", "terrain-background", "terrain-labels", 
             "terrain-lines", "toner", "toner-2010", "toner-2011", 
             "toner-background", "toner-hybrid", "toner-labels", 
             "toner-lines", "toner-lite", "watercolor")

Retrieve another example of a map terrain type.

faleme_toner2011 <- get_stamenmap(bbox = c(left = min(longdd)*1.05, bottom = min(latdd)/1.05, right = max(longdd)/1.05, top = max(latdd)*1.05), zoom = 8, maptype="toner-2011")
## Source : http://tile.stamen.com/toner-2011/8/118/117.png
## Source : http://tile.stamen.com/toner-2011/8/119/117.png
## Source : http://tile.stamen.com/toner-2011/8/120/117.png
## Source : http://tile.stamen.com/toner-2011/8/118/118.png
## Source : http://tile.stamen.com/toner-2011/8/119/118.png
## Source : http://tile.stamen.com/toner-2011/8/120/118.png
## Source : http://tile.stamen.com/toner-2011/8/118/119.png
## Source : http://tile.stamen.com/toner-2011/8/119/119.png
## Source : http://tile.stamen.com/toner-2011/8/120/119.png
## Source : http://tile.stamen.com/toner-2011/8/118/120.png
## Source : http://tile.stamen.com/toner-2011/8/119/120.png
## Source : http://tile.stamen.com/toner-2011/8/120/120.png

Adding the stations to the two types of maps of the region.

ggmap(faleme) + geom_point(
  aes(x = longdd, y = latdd, color=typesite),
  data = df_dd_stations, colour = c("red","green")[typesite], size = 3
)

ggmap(faleme_toner2011) + geom_point(
  aes(x = longdd, y = latdd, color=typesite),
  data = df_dd_stations, colour = c("red","green")[typesite], size = 3
)

2 Adjustments to hydrological data

if(!require(fitdistrplus)){install.packages("fitdistrplus");library(fitdistrplus)}
## Le chargement a nécessité le package : fitdistrplus
## Le chargement a nécessité le package : MASS
## Le chargement a nécessité le package : survival

2.1 KIDIRA

Représentations des données.

plotdist(station_data$KIDIRA) 

descdist(station_data$KIDIRA, boot = 1000)

## summary statistics
## ------
## min:  60   max:  963 
## median:  212 
## mean:  253.1034 
## estimated sd:  186.9003 
## estimated skewness:  2.230276 
## estimated kurtosis:  7.713073

2.1.1 Some examples of fitting distributions to data.

2.1.1.1 Log-normal

KIDIRA.ln <- fitdist(station_data$KIDIRA, "lnorm")
summary(KIDIRA.ln)
## Fitting of the distribution ' lnorm ' by maximum likelihood 
## Parameters : 
##          estimate Std. Error
## meanlog 5.3476352 0.06169314
## sdlog   0.5754353 0.04362305
## Loglikelihood:  -540.6132   AIC:  1085.226   BIC:  1090.158 
## Correlation matrix:
##         meanlog sdlog
## meanlog       1     0
## sdlog         0     1
plot(KIDIRA.ln)

2.1.1.2 Gumbel

if(!require(actuar)){install.packages("actuar");library(actuar)}
## Le chargement a nécessité le package : actuar
## 
## Attachement du package : 'actuar'
## Les objets suivants sont masqués depuis 'package:stats':
## 
##     sd, var
## L'objet suivant est masqué depuis 'package:grDevices':
## 
##     cm
KIDIRA.gumbel <- fitdist(station_data$KIDIRA, "gumbel", start=list(alpha=10, scale=10))
summary(KIDIRA.gumbel)
## Fitting of the distribution ' gumbel ' by maximum likelihood 
## Parameters : 
##       estimate Std. Error
## alpha 183.3946  11.244267
## scale 101.3473   9.242219
## Loglikelihood:  -548.6173   AIC:  1101.235   BIC:  1106.166 
## Correlation matrix:
##           alpha     scale
## alpha 1.0000000 0.2565498
## scale 0.2565498 1.0000000
plot(KIDIRA.gumbel)

2.1.1.3 Log-logistic

KIDIRA.ll <- fitdist(station_data$KIDIRA, "llogis", start = list(shape = 1, scale = 10))
summary(KIDIRA.ll)
## Fitting of the distribution ' llogis ' by maximum likelihood 
## Parameters : 
##         estimate Std. Error
## shape   3.199725  0.2905542
## scale 203.019858 11.6813127
## Loglikelihood:  -539.0256   AIC:  1082.051   BIC:  1086.983 
## Correlation matrix:
##            shape      scale
## shape  1.0000000 -0.0292221
## scale -0.0292221  1.0000000
plot(KIDIRA.ll)

2.1.1.4 Pareto

KIDIRA.P <- fitdist(station_data$KIDIRA, "pareto", start = list(shape = 1, scale = 10))
## Warning in sqrt(diag(varcovar)): Production de NaN
## Warning in sqrt(1/diag(V)): Production de NaN
## Warning in cov2cor(varcovar): diag(.) à 0 ou NA entrées ; les résultats non
## finis sont douteux
summary(KIDIRA.P)
## Fitting of the distribution ' pareto ' by maximum likelihood 
## Parameters : 
##         estimate Std. Error
## shape    9085979   4194.304
## scale 2300787437        NaN
## Loglikelihood:  -568.4405   AIC:  1140.881   BIC:  1145.813 
## Correlation matrix:
##       shape scale
## shape     1   NaN
## scale   NaN     1
plot(KIDIRA.P)

2.1.1.5 Burr

KIDIRA.B <- fitdist(station_data$KIDIRA, "burr", start = list(shape1 = 1, shape2 = 1))
## Warning in checkparamlist(arg_startfix$start.arg, arg_startfix$fix.arg, : Some
## parameter names have no starting/fixed value but have a default value: rate,
## scale.
summary(KIDIRA.B)
## Fitting of the distribution ' burr ' by maximum likelihood 
## Parameters : 
##          estimate Std. Error
## shape1 0.07115278  0.1742363
## shape2 2.62656386  6.4320445
## Loglikelihood:  -698.1134   AIC:  1400.227   BIC:  1405.159 
## Correlation matrix:
##            shape1     shape2
## shape1  1.0000000 -0.9990415
## shape2 -0.9990415  1.0000000
plot(KIDIRA.B)

2.1.2 Graphical comparison of the fits

By their densities.

denscomp(lapply(paste("KIDIRA",c("ln","gumbel","ll","P","B"),sep="."),get), legendtext = c("lognormal","gumbel", "loglogistic", "Pareto", "Burr"))

By quantile-quantile diagrams.

qqcomp(lapply(paste("KIDIRA",c("ln","gumbel","ll","P","B"),sep="."),get), legendtext = c("lognormal","gumbel", "loglogistic", "Pareto", "Burr"))

By their distribution functions.

cdfcomp(lapply(paste("KIDIRA",c("ln","gumbel","ll","P","B"),sep="."),get), legendtext = c("lognormal","gumbel", "loglogistic", "Pareto", "Burr"))

By probability-probability diagrams.

ppcomp(lapply(paste("KIDIRA",c("ln","gumbel","ll","P","B"),sep="."),get), legendtext = c("lognormal","gumbel", "loglogistic", "Pareto", "Burr"))

2.1.3 Numerical comparison of fits

gofstat(lapply(paste("KIDIRA",c("ln","gumbel","ll","P","B"),sep="."),get), fitnames = c("lognormal","gumbel", "loglogistic", "Pareto", "Burr"))
## Goodness-of-fit statistics
##                              lognormal    gumbel loglogistic    Pareto
## Kolmogorov-Smirnov statistic 0.1134799 0.1244879  0.07528331 0.2882154
## Cramer-von Mises statistic   0.1778413 0.2931510  0.06870337 1.7262407
## Anderson-Darling statistic   1.3924941 2.6919756  0.81675664 9.1683672
##                                   Burr
## Kolmogorov-Smirnov statistic  0.535988
## Cramer-von Mises statistic    6.994845
## Anderson-Darling statistic   32.306880
## 
## Goodness-of-fit criteria
##                                lognormal   gumbel loglogistic   Pareto     Burr
## Akaike's Information Criterion  1085.226 1101.235    1082.051 1140.881 1400.227
## Bayesian Information Criterion  1090.158 1106.166    1086.983 1145.813 1405.159

2.2 Other distributions

The initial study suggests testing the fit to Galton’s (= lognormal), Goodrich’s, Pearson’s 3, Gumbel, …

2.3 Other techniques

Apply alternative distribution selection approaches to these data.

3 Applications to other variables.

plotdist(station_data$GOURBA) 

plotdist(station_data$FADOU) 

plotdist(station_data$GOUc) 

plotdist(station_data$KILIc) 

descdist(station_data$GOURBA, boot = 1000)

## summary statistics
## ------
## min:  81   max:  966 
## median:  186 
## mean:  264.8276 
## estimated sd:  221.6834 
## estimated skewness:  1.860305 
## estimated kurtosis:  5.103623
descdist(station_data$FADOU, boot = 1000)

## summary statistics
## ------
## min:  40   max:  998 
## median:  152 
## mean:  345.1379 
## estimated sd:  310.0986 
## estimated skewness:  0.8923851 
## estimated kurtosis:  2.102951

Valeurs annuelles cumulées.

descdist(station_data$GOUc, boot = 1000)

## summary statistics
## ------
## min:  320   max:  2681 
## median:  1449 
## mean:  1452.08 
## estimated sd:  517.403 
## estimated skewness:  0.09928642 
## estimated kurtosis:  2.506413
descdist(station_data$KILIc, boot = 1000)

## summary statistics
## ------
## min:  136   max:  3351 
## median:  1701 
## mean:  1690.644 
## estimated sd:  712.0299 
## estimated skewness:  0.02646055 
## estimated kurtosis:  2.52774

4 Mystery data

Which distribution(s) is/are compatible with this data?

donnees <- read.csv("https://tinyurl.com/yxn9udgt")
str(donnees)
## 'data.frame':    1000 obs. of  1 variable:
##  $ donnees: num  1.346 1.116 0.918 1.096 1.403 ...
 

A work by Frédéric Bertrand

frederic.bertrand@utt.fr